Clustering with the Basic Sequential Clustering Algorithm

Garrett McCue

Goal of the assignment is to apply BSCA, testing 3 different values for both of the clustering hyperparameters, $\alpha$ and $M$ , to the 1D and 2D projections of the Stellar Classification Dataset from both PCA and LD techniques which was apart of the Dimensionality Reduction assignment.

Table of Contents

1. What is Clustering?

2. BSCA Logic

3. Load and Transform Data

4. BSCA Code

5. 1D Projections with BSCA

6. 2D Projections with BSCA

7. Clustering Analysis

8. Complete Code

What is Clustering?

Clustering is an unsupervised technique that aims to form groupings of data points without any class knowledge. This technique is not classification, because we want to discover groupings of data independent of their respective classes. Since this process is unsupervised it can be common to get results that have a different number of groupings in comparison with the amount of unique classes within the data. Determining what data point belong within each cluster and the number of cluster is based on a chosen distance measurement between data points and the clusters as well as a specified distance threshold and max cluster number. When implementing clustering techniques the high dimensionality of data can make the process difficult. Sometimes it might be necessary to perform dimensionality reduction techniques before the application of a clustering technique. It can also be tricky to find the correct number of clusters as well as choosing the distance threshold. When implementing clustering the threshold of distance ($\alpha$) parameter and max number of clusters ($M$) parameter must be specified. You can initialize the first point as the first cluster and then loop through the rest of the points, or samples until the end. While iterating through the points the minimum distance between each sample point and the cluster(s) will be computed, and if it falls beyond the specified distance threshold and the current cluster total is less than $M$ then a new cluster can be formed from that point. If the minimum distance of a point with one of the clusters cluster falls below the distance threshold, then that point is then added to the cluster it is closest too. This process iterates through all samples adding each one to a cluster or creating a new cluster until all samples distances have been measured and grouped.

BSCA Logic

User defined parameters... $$\alpha = \text{distance threshold}$$ $$ M = \text{maximum cluster count} $$ Initialize the cluster total $t$ as 1 and the first sample, $\vec{X_i}$ as the first cluster, $C_1$ $$ t = 1;\quad \text{where}\; t = \text{cluster total} $$ $$\vec{X_1} = C_1$$ For all samples, $\vec{X_2}...\vec{X_N}$, find the minimum distance to each cluster $d_{min}(\vec{X_i}, C_j)$ $$ d_{min}(\vec{X_i}, C_j)\;; \quad i = 2,...,N $$ If the distance is beyond the threshold $\alpha$, and the cluster total ($t$) is less than the threshold $M$ then create a new cluster from the data point and update the total cluster number. $$ IF \quad d(\vec{X_i}, C_j) > \alpha \quad AND\quad t < M $$ then, $$ t = t + 1 $$ $$ C_t = \vec{X_i} $$ If the distance is not beyond the threshold ($\alpha$) then add that point to the corresponding cluster ($C_j$) with the smallest distance measure $$ IF \quad d(\vec{X_i}, C_j) < \alpha $$ then, $$ C_j = C_j \cup \vec{X_i} $$

Continue finding the minimum distances and assigning samples to the closest cluster or create a new cluster from samples until all samples have been assigned to a cluster

Load and Transform Data

PCA Projection of Data

LDA Projection of Data

BSCA Code

Cluster the data

Functions to aid in clustering evaluation

1D Projections with BSCA

PCA 1D Clustering Results

LDA 1D Clustering Results

2D Projections with BSCA

PCA 2D Clustering Results

LDA 2D Clustering Results

Clustering Analysis

Complete Code